Red and White Wine Analysis by Jawsem Al-Hashash

Introduction

For this analysis we will be using a dataset that contains 6497 with 11 variables on the chemical properties of the wine as well as a variable for the wine’s color (Red or White). It also includes an output variable called quality where 3 experts rate the wine on a scale from 0 to 10. There are 1599 reds and 4898 white wines in the data set.

The main question we are trying to answer with this analysis

Univariate Plots Section

## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ color               : chr  "White" "White" "White" "White" ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality         color          
##  Min.   :3.000   Length:6497       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.818                     
##  3rd Qu.:6.000                     
##  Max.   :9.000

We loaded tables above to examine some of the patterns in the data. Since quality is going to be our dependent variable for the analysis we decided to plot a simple histogram of the occurrences of quality.

We decided to do a side by side comparison of quality between the two colors of wine. It looks like Red Wines have mode of 5 vs the white wines mode of 6, let us try to scale the histograms by a percentage so we can better compare them.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The two histograms abaove are the same as the ones before but scaled to a percentage of total observations by color. It makes it easier to see the different since they are scaled. It is obvious now that Red Wine has more 5s than White wine. It appears about 80 percent of the scores for reds are 5 and 6 while white wine is more spread out and has more 7s.

We also took the means of the quality for each so we could see the differences. It appears each of the data sets have the same median but the mean is slightly higher for white wines.

Above is frequency plot of white wine quality and red wine quality. We have also scaled the values by percentage of the number of wines of that color in our data set. It is similar to the previous histograms but it allows us to put them on the same plot with the same scale. Looking at this, once again it seems like white whines are of slightly higher quality than reds, however the plots follow eachother fairly closely. Lets see if we can prove that on average white wines are of better quality than reds.

## 
##  Two Sample t-test
## 
## data:  subset(wine_df, color == "White")$quality and subset(wine_df, color == "Red")$quality
## t = 9.6856, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.2008028       Inf
## sample estimates:
## mean of x mean of y 
##  5.877909  5.636023

We did a simple Welch two sample t test. With this test we make the following assumptions,

  1. The Quality of the wines for both Red and White wines are normally distributed.
  2. The variance for each sets of quality (red and white) are equal.

The p-value for this test is very low (close to 0)

This means that we can reject the null hypothesis that the red and white wines are rated the same quality accept that on average white wines are rated higher than reds.

Note this only applies to this dataset. Perhaps there are some chemical properties that white wines have that are different than red wines that are causing this small difference.

The plots above are histograms for all chemical properties of the wines. It appears that volatile.acidity, residual sugar and alcohol are all left skewed. Citric acid, pH and density look more normally distributed.

Above are attempted transformations on the total.sulfur.dioxide part of the histograms. We attempted a log10 scale, a square root scale and a cube root scale. Each scale tested indicate a more left skewed distribution than previously seen.

Above are boxplots for all the chemical properties. These types of plots just give us a general idea of what the data for each metric looks like. It gives an idea of what to look at going forward. it looks like Alcohol, density total.sulfur.dioxide have data that is more concentrated around the median with fewer outliers. Fixed.acitity, sulphates, volatile.aciditity, and chlorides look have they have more outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800
##    Low Medium   High 
##   2120   2393   1984
##    Low Medium   High 
##   2832   2813    852

## # A tibble: 3 x 2
##   alcohol.level quality
##   <fct>           <dbl>
## 1 Low              5.46
## 2 Medium           5.96
## 3 High             6.57
## # A tibble: 3 x 2
##   sugar.level quality
##   <fct>         <dbl>
## 1 Low            5.78
## 2 Medium         5.91
## 3 High           5.75

We created some categorical variables for sugar and alcohol levels by splitting them up into Low, Medium and High. The two plots above are stacked histograms for each of these variables.

You can see that the higher alcohol content wines are in general have higher quality values (you can see the blue on the histogram is conecntatrated at 6 and above), while the high residual sugar wines are more concentrated at 5 and 6 quality.

You can also see this in the summarized metric tables provided. Higher alcohol level correlates to higher quality wine while higher sugar levels are associated with Medium quality.

Univariate Analysis

What is the structure of your dataset?

There are 6497 wines in this data set, originally there were two data sets with both red and white wine but they were combined and a variable was added to differentiate them.

There are 11 chemical properties of the wine each with a numeric value (see below.)

  • 1 - fixed.acidity (tartaric acid - g / dm^3)
  • 2 - volatile.acidity (acetic acid - g / dm^3)
  • 3 - citric.acid (g / dm^3)
  • 4 - residual.sugar (g / dm^3)
  • 5 - chlorides (sodium chloride - g / dm^3
  • 6 - free.sulfur.dioxide (mg / dm^3)
  • 7 - total.sulfur.dioxide (mg / dm^3)
  • 8 - density (g / cm^3)
  • 9 - pH
  • 10 - sulphates (potassium sulphate - g / dm3)
  • 11 - alcohol (% by volume)

There is also an output variable (Quality) which is a score between 0 and 10 which is the median of at least 3 evaluations by wine experts.

There is also a variable that indicates the type of wine called color which can be Red or White.

Finally two categorical variables were added based on the alcohol and residual sugar variables alcohol.level and sugar.level. These are both factors with 3 levels (Low, Medium and High)

alcohol.level

  • Low: alcohol values from 7 to 10
  • Medium: alcohol values from 10 to 12
  • High: alcohol values greater than 12

sugar.level

  • Low: residual sugar levels from 0 to 2
  • Medium: residual sugar levels from 2 to 7
  • High: residual sugar levels greater than 7

What is/are the main feature(s) of interest in your dataset?

The main features of interest in our dataset are residual sugar, alcohol and quality. We want to use the combination of the 2 variables to see if we can create a model that determines the quality of the wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Along with residual sugar and alcohol some of the other chemical properties could also be used for our analysis. Some ones that peak my interst are pH and total sulfur dioxide as I suspect those might also have an affect.

Did you create any new variables from existing variables in the dataset?

Yes we created alcohol.level and sugar.level based on the alcohol and residual.sugar variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

We attempted some transformations on total.sulfure.dioxide to see if we could get a more normal distribution. We did log10, square root and cubed root. We ended up with a more left skewed plot.

Bivariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  wine_df$residual.sugar and wine_df$alcohol
## t = -31.04, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3804069 -0.3380525
## sample estimates:
##        cor 
## -0.3594148
## 
##  Pearson's product-moment correlation
## 
## data:  wine_df$alcohol and wine_df$quality
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4245892 0.4636261
## sample estimates:
##       cor 
## 0.4443185

The first graph above is Residual sugar vs alcohol content. There are a lot of values concentrated at a low residual sugar level and as you go to a higher residual sugar the alcohol content gets lower indicating a negative correlation.

The second graph is alcohol vs quality. It looks like as alcohol conent rises the quality rises indicating a positive correlation between alcohol conent and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine_df$quality and wine_df$residual.sugar
## t = -2.9824, df = 6495, p-value = 0.002871
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06124221 -0.01267509
## sample estimates:
##         cor 
## -0.03698048

The first graph above is just residual sugar vs quality. It looks like there is not strong correlation either way for these to attributes.

The next two graphs we tried to graph pH with two acidity metrics, citric acid and fixed.acidity. It looks like there is a slightly negative correlation between these metrics. Which makes sense since lower pH levels are more acidic.

The first graph above shows sulphates vs quality. The second graph shows pH vs quality. There looks to be little correlation between them.

The third graph is 9 plots for each quality with pH vs sulphates. Looking at each of these plots it is hard to tell if either of these factor into quality.

The five plots above give us quality vs 5 different variables. The five variables are free sulfur dioxide, chlorides, total sulfur dioxide, density and volatile acidity. Looking at the graphs it is hard to tell if there are corelations. I am going to take a closer look at density and chlorides as those appear to have some correlation with higher quality.

In the plots above we are looking at bar graphs of the median chlorides and a similar scatter plot graph of density that we had previously. The density does appear to have a median that goes down as quality goes up. You can also tell from the chlorides bar graph that the chlorides also go down as the quality goes up.

## 
##  Pearson's product-moment correlation
## 
## data:  wine_df$quality and wine_df$chlorides
## t = -16.508, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2238898 -0.1772134
## sample estimates:
##        cor 
## -0.2006655

The last bivariate graph is is a facet wrap of alcohol vs chlorides of each quality. If you look closely you can see as the quality gets higher the graph is shifting toward the y axis (lower chorides) and is shifting higher up (higher alcohol content).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

After reviewing alcohol conent and residual sugar I determined that there was a positive correlation between alcohol content and quality. I also discovered that the pH negatively correlated with the acidity fields. After reviewing the other features I did find that chlorides and density do have a small negative correlation with quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I observed that pH has a negative correlation with the acidity features. This makes sense because a lower pH indicates higher acidity.

What was the strongest relationship you found?

The strongest relationships I found were alcohol content to quality, chlorides, and density to quality.

Multivariate Plots Section

The first plot is a Chrlorides vs Alcohol content with the colors set to the Quality. You can see as the chlorides lower and alcohol content get higher quality increases.

The plot of above is a plot of average quality by alcohol content for red and white wines. You can see clearly that the trend for both white and red wines is that higher alcohol content is correlated with higher quality.

The above is a ggpairs of all the variables we haven’t done that much exploration on. We know we want to look at chlorides and alcohol. Based on this graph the largest correlation with quality is density. Baased on this we will explore density a bit more.

The above graph shows desity vs chlorides. It looks like from the plot that density and chlorides have a positive correlation (which makes sense since they both have a negative correlation to quality). There are 3 plots for each alcohol level. You can see as the alcohol level gets higher the plot shifts down and turns more green. This means that alcohol level is negatively correlated with density and chlorides. It also means that higher alcohol level correlate with higher quality.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine_df)
## m2: lm(formula = I(quality) ~ I(alcohol) + chlorides, data = wine_df)
## m3: lm(formula = I(quality) ~ I(alcohol) + chlorides + density, data = wine_df)
## m4: lm(formula = I(quality) ~ I(alcohol) + density, data = wine_df)
## 
## ==========================================================================
##                        m1            m2            m3            m4       
## --------------------------------------------------------------------------
##   (Intercept)         2.405***      2.717***     -7.179         2.810     
##                      (0.086)       (0.094)       (4.644)       (4.512)    
##   I(alcohol)          0.325***      0.308***      0.324***      0.325***  
##                      (0.008)       (0.008)       (0.011)       (0.011)    
##   chlorides                        -2.309***     -2.476***                
##                                    (0.285)       (0.296)                  
##   density                                         9.793*       -0.399     
##                                                  (4.595)       (4.454)    
## --------------------------------------------------------------------------
##   R-squared           0.197         0.205         0.206         0.197     
##   adj. R-squared      0.197         0.205         0.206         0.197     
##   sigma               0.782         0.779         0.778         0.782     
##   F                1597.641       839.499       561.486       798.702     
##   p                   0.000         0.000         0.000         0.000     
##   Log-likelihood  -7623.404     -7590.806     -7588.534     -7623.400     
##   Deviance         3975.734      3936.038      3933.286      3975.729     
##   AIC             15252.809     15189.613     15187.068     15254.801     
##   BIC             15273.146     15216.729     15220.964     15281.917     
##   N                6497          6497          6497          6497         
## ==========================================================================

I created 4 models 1 with quality vs alcohol content, 1 with quality alcohol content and chlorides and 1 with quality, alcohol content, chlorides and density and 1 with quality, alcohol content and denisty. It looks like the best model combined all 3 variables as it had a higher r^2 value. The density does not add much in comparison to the chlorides.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

After reviewing some plots around alcohol content, chlorides and quality it is clear that alcohol content has the strongest correlation to quality.

Were there any interesting or surprising interactions between features?

I noticed that density and chlorides were postively correlated.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

The model I created only explaines 20.6% of the variance. When we added density it added only improved he R^2 value by .1 %. This makes some sense since it was positively correlated with chlorides.

This model does not explain much of the variance between wine quality and would not be a good model to predict quality.


Final Plots and Summary

Plot One

Description One

This was a plot of red wine vs white wine frequency vs quality. I was able to scale the grapah by the aount of white and red wines and was able to show that white wines on average are rated higher than reds.

It is the basis for the statistical test on quality vs color I did univariate section.

Plot Two

Description Two

The second plot is a plot of average quality vs chloride level. This plot confirmed the negative correlation between quality and chloride level and is the basis for why I added the chloride feature to my model in the multi-variate analysis section.

Plot Three

Description Three

The last plot is where I bring all three variables I suspect have correlations with quality together and graph them in one image. You can clearly see that as alcohol level is increasing so does the quality. You can also see the negative correlation that chlorides and density has with quality ase the graph shifts down as you move accross each alcohol level.


Reflection

Overall I was able to determine that on average white wines are rated higher quality than white wines. I was also able to determine that alcohol content correlates to higher quality and chlorides and density negatively correlates to quality.

I was able to use different types of plots to show this as well as create a linear model that helps determine quality based on these variables.

It was a struggle to find features that help determine quality. A lot of the variables seemed to have very little impact or correlations. I tried to transform some of the variables but they still did not seem to correlate very well.

I think if I was to do further analysis I would try to transform more of the variables with square roots or research other transformations I could do. It would also be more interesting if the quality variable was not a median but was mean or if it was from 0 to 100 instead of 0 to 10.